Mapping of Sequence Reads to the Reference Genomes    ◾    63

2.3  READ SEQUENCE ALIGNMENT AND ALIGNERS

For read mapping, we usually have millions of reads produced by a high-throughput

instrument and we wish to determine the origin of each of the reads in the sequence of

a reference genome. For most of sequencing applications, the read alignment or mapping

is the slowest and the most computationally expensive step. This is because mapping pro-

grams will attempt to determine the most likely points of origin for each read with respect

to a reference genome. Mapping the reads produced from eukaryotic RNA-Seq requires

extra efforts by aligners. In eukaryote, the coding regions (exons) of the genes are sepa-

rated by non-coding regions (introns). Since only transcriptome or gene transcripts are

targeted in the RNA sequencing, the aligners used for mapping RNA-Seq reads must be

aware of the non-contiguous nature of the exons and the challenge of the detection of the

splicing regions.

Indeed, before performing alignment, we need to download the sequence of the refer-

ence genome of the species studied and then index the reference genome so that the loca-

tions, where reads maps, can be found easily upon the process of searching and alignment.

For most of the aligners, indexing of the reference genome is the first step before per-

forming read mapping. Above, we discussed the most commonly used indexing methods

for storing and organizing the reference genome sequence so it can be easily searched to

determine the locations (coordinates) of aligned reads. There is another challenge faced

by aligners; the reads produced by sequencers may not be exactly aligned to a location in

the reference genome sequence because of base call errors or may not be naturally due to

mutations (substitutions, deletions, or insertions) in the DNA sequences of the individual

FIGURE 2.12  (a) BWT, (b) rank table, and (c) lookup table.